AITopics | attention distribution

Collaborating Authors

attention distribution

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

b2c4b7d34b3d96b9dc12f7bce424b7ae-Paper-Conference.pdf

Neural Information Processing SystemsJun-22-2026, 00:46:27 GMT

Attention sink (AS) is a consistent pattern in transformer attention maps where certain tokens (often special tokens or positional anchors) disproportionately attract attention from other tokens. We show that in transformers, AS is not an architectural artifact, but it is the manifestation of a fundamental geometric principle: the establishment of reference frames that anchor representational spaces. We analyze several architectures and identify three distinct reference frame types, centralized, distributed, and bidirectional, that correlate with the attention sink phenomenon. We show that they emerge during the earliest stages of training as optimal solutions to the problem of establishing stable coordinate systems in high-dimensional spaces. We show the influence of architecture components, particularly position encoding implementations, on the specific type of reference frame. This perspective transforms our understanding of transformer attention mechanisms and provides insights for both architecture design and the relationship with AS.

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Quaternion Self-Attention with Shared Scores

Yamauchi, Shogo, Nitta, Tohru, Tamori, Hideaki

arXiv.org Machine LearningMay-26-2026

Quaternion neural networks are parameter-efficient and model multidimensional dependencies by representing four related features as a single entity. However, existing quaternion self-attention computes component-wise scores and applies independent softmax operations to each component, which increases the computational cost and allows attention distributions to diverge across components. We propose a shared-score quaternion self-attention mechanism that computes a single real-valued score using the quaternion inner product and applies a shared attention distribution across all components. This reduces score-computation multiplications by 75% and the number of softmax operations from four to one. We prove that, when queries and keys are produced by quaternion linear projections that induce component pre-mixing, the component-wise and shared scores lie in the same interaction subspace, indicating that independent component-wise attention primarily re-parameterizes the same interactions rather than expanding the feature interaction space. In speech enhancement, our method reduces inference time by up to 44.3% on a GPU and 58.1% on a CPU while maintaining quality, with consistent trends across vision and natural language processing.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

2605.2492

Country:

Asia > Japan (0.28)
North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)

Add feedback

Generalizable Multi-Linear Attention Network

Neural Information Processing SystemsApr-25-2026, 18:55:21 GMT

The majority of existing multimodal sequential learning methods focus on how to obtain powerful individual representations and neglect to effectively capture the multimodal joint representation. Bilinear attention network (BAN) is a commonly used integration method, which leverages tensor operations to associate the features of different modalities. However, BAN has a poor compatibility for more modalities, since the computational complexity of the attention map increases exponentially with the number of modalities. Based on this concern, we propose a new method called generalizable multi-linear attention network (MAN), which can associate more modalities in acceptable complexity with hierarchical approximation decomposition. Specifically, considering the fact that softmax attention kernels cannot be decomposed as linear operation directly, we adopt the addition random features mechanism to approximate the non-linear softmax functions with enough theoretical analysis. Furthermore, we also introduce the local sequential constraints, which can be combined with ARF conveniently, as positional information. We conduct extensive experiments on several datasets of corresponding tasks, the experimental results show that MAN could achieve competitive results compared with baseline methods, showcasing the effectiveness of our contributions.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.48)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Concentrate Attention: Towards Domain-Generalizable Prompt Optimization for Language Models

Neural Information Processing SystemsMar-17-2026, 21:01:43 GMT

Recent advances in prompt optimization have notably enhanced the performance of pre-trained language models (PLMs) on downstream tasks. However, the potential of optimized prompts on domain generalization has been under-explored. To explore the nature of prompt generalization on unknown domains, we conduct pilot experiments and find that (i) Prompts gaining more attention weight from PLMs' deep layers are more generalizable and (ii) Prompts with more stable attention distributions in PLMs' deep layers are more generalizable. Thus, we offer a fresh objective towards domain-generalizable prompts optimization named ''Concentration'', which represents the ''lookback'' attention from the current decoding token to the prompt tokens, to increase the attention strength on prompts and reduce the fluctuation of attention distribution.We adapt this new objective to popular soft prompt and hard prompt optimization methods, respectively. Extensive experiments demonstrate that our idea improves comparison prompt optimization methods by 1.42% for soft prompt generalization and 2.16% for hard prompt generalization in accuracy on the multi-source domain generalization setting, while maintaining satisfying in-domain performance. The promising results validate the effectiveness of our proposed prompt optimization objective and provide key insights into domain-generalizable prompts.

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Bilinear Attention Networks

Neural Information Processing SystemsMar-16-2026, 23:26:31 GMT

Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, co-attention builds two separate attention distributions for each modality neglecting the interaction between multimodal inputs. In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly. BAN considers bilinear interactions among two groups of input channels, while low-rank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eight-attention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new state-of-the-arts on both datasets.

artificial intelligence, natural language, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.61)

Add feedback

FastTransformerswithClusteredAttention SupplementaryMaterial

Neural Information Processing SystemsFeb-11-2026, 03:58:47 GMT

WefirstclusterthequeriesQusingtheK-means clustering to outputS which indicates the membership of queries to different clusters. The lower half of the figure shows the new valueˆVt computed by sparse dot-products with the keysK and values V corresponding tothe the top-k keys inT. Figure 6: We show training/validation loss convergence for different transformer variants. Both the clustered variants are have a significantly better convergence than bothlsh-1 and lsh-4. Note that due to a smaller batch sizefullmakesmanymoreupdates than allother transformer variants. In figure 6a, we show the training loss convergence for different transformer variants.

artificial intelligence, machine learning, variant, (17 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland (0.06)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.49)

Add feedback

f6a8dd1c954c8506aadc764cc32b895e-Paper.pdf

Neural Information Processing SystemsFeb-11-2026, 03:58:40 GMT

Clustered attention makes use of similarities between queries and groups them in order to reduce the computational cost. In particular, we perform fast clustering using locality-sensitive hashing and K-Means and only compute the attention once per cluster.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland (0.05)
North America > United States > New York > New York County > New York City (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.51)

Add feedback

8c3c666820ea055a77726d66fc7d447f-Supplemental-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsFeb-10-2026, 16:01:51 GMT

dataset, graph, shortest-path distance, (14 more...)

Neural Information Processing Systems

Country: North America > Canada > Quebec (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

A Data Collection and Details about the

Neural Information Processing SystemsFeb-10-2026, 11:13:31 GMT

We collected about 30 million text-image pairs from multiple channels, and built a 2.5TB new dataset (after tokenization, the size becomes about 250GB). The sources of data are basically classified into the following categories: (1) Professional image websites (both English and Chinese). The images in the websites are usually with captions. We have already introduced tokenizers in section 2.2, and here are some details. Colored grids are all the tokens attended to by the token marked "O".

caption, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Technology: